Method for processing an audio sequence for example a piece of music

ABSTRACT

A method of processing a sound sequence corresponding in particular to a piece of music includes a succession of subsequences from among at least an introduction, a verse, a refrain, a bridgeway, a theme, a motif, a movement, in which: a) a spectral transform is applied to said sequence to obtain spectral coefficients varying as a function of time in said sequence, b) at least one subsequence repeated in said sequence is determined by statistical analysis of said spectral coefficients, and c) start and end instants of a first subsequence, such as a verse, and of a second subsequence, such as a refrain, are evaluated so as to substantially concatenate the first subsequence with the second subsequence.

The present invention relates to the processing of a sound sequence,such as a piece of music or, more generally, a sound sequence comprisingthe repetition of a subsequence.

Distributors of musical productions, for example recorded on CD,cassette or other medium, make booths available to potential customerswhere the customers can listen to music of their choice, or else musicpromoted on account of its novelty. When a customer recognizes a verseor a refrain from the piece of music to which he is listening, he candecide to purchase the corresponding musical production.

More generally, an averagely attentive listener concentrates hisattention more on a verse and refrain strung together, than on theintroduction of the piece, in particular. It will thus be understoodthat a sound resume comprising at least one verse and one refrain wouldsuffice for dissemination among booths of the aforesaid type, ratherthan providing for the complete musical production to be disseminated.

In another application such as the transmission of sound data by mobiletelephone, it will be understood that the downloading of the completepiece of music onto a mobile terminal, from a remote server, is muchlengthier and, therefore, more expensive than the downloading of a soundresume of the aforesaid type.

Likewise, in an electronic commerce context, sound resumes may bedownloaded onto a facility communicating with a remote server, via anextended network of the INTERNET type. The user of the computer facilitymay thus place an order for a musical production whose sound resume helikes.

However, detecting a verse and a refrain by ear and thus creating asound resume for all the musical productions distributed would be aprohibitively cumbersome task.

The present invention aims to improve the situation.

One of the aims of the present invention is to propose an automateddetection of a subsequence repeated in a sound sequence.

Another aim of the present invention is to propose an automated creationof sound resumes of the type described above.

For this purpose, the present invention pertains firstly to a method ofprocessing a sound sequence, in which:

a) a spectral transform is applied to said sequence to obtain spectralcoefficients varying as a function of time in said sequence.

The method within the sense of the invention furthermore comprises thefollowing steps:

b) at least one subsequence repeated in said sequence is determined bystatistical analysis of said spectral coefficients, and

c) start and end instants of said subsequence in the sound sequence areevaluated.

Advantageously, according to an additional step:

d) the aforesaid subsequence is extracted so as to store, in a memory,sound samples representing said subsequence.

Preferably, the extraction of step d) relates to at least onesubsequence whose duration is the biggest and/or one subsequence whosefrequency of repetition is the biggest in said sequence.

The present invention finds an advantageous application in aiding thedetection of failures of industrial machines or motors, especially byobtaining sound recording sequences of phases of acceleration and ofdeceleration of the motor speed. The application of the method withinthe sense of the invention makes it possible to isolate a soundsubsequence corresponding for example to a steady speed or to anacceleration phase, this subsequence being, as the case may be, comparedwith a reference subsequence.

In another advantageous application to the obtaining of musical data ofthe type described above, the sound sequence is a piece of musiccomprising a succession of subsequences from among at least anintroduction, a verse, a refrain, a bridgeway, a theme, a motif, or amovement which is repeated in the sequence. In step c), at least therespective start and end instants of a first subsequence and of a secondsubsequence are determined.

In a particularly advantageous embodiment, in step d), a first and asecond subsequence are extracted so as to obtain, on a memory medium, asound resume of said piece of music comprising at least the firstsubsequence strung together with the second subsequence.

Preferably, the first subsequence corresponds to a verse and the secondsubsequence corresponds to a refrain.

However, it may happen that a first and a second subsequence, that areextracted from a sound sequence, are not contiguous in time.

For this purpose, the following steps are moreover provided:

d1) detecting at least one cadence of the first subsequence and/or ofthe second subsequence so as to estimate the mean duration of a bar atsaid cadence, as well as at least one end segment of the firstsubsequence and at least one start segment of the second subsequence, ofrespective durations corresponding substantially to said mean durationand isolated in the sequence by an integer number of mean durations,

d2) generating at least one bar of transition of duration correspondingto said mean duration and comprising an addition of the sound samples ofat least said end segment and of at least said start segment,

d3) and concatenating the first subsequence, the transition bar or barsand the second subsequence to obtain a stringing together of the firstand of the second subsequence.

It will be noted that the succession of steps d1) to d3) finds, over andabove the automatic generation of sound resumes, an advantageousapplication to computer assisted musical creation. In this application,a user can himself create two subsequences of a piece of music, whereassoftware comprising instructions for running steps d1) to d3) providesfor the stringing together of the two subsequences by concatenation,without artefact and pleasant to the ear.

More generally, the present invention is also aimed at a computerprogram product, stored in a computer memory or on a removable mediumable to cooperate with a computer reader, and comprising instructionsfor running the steps of the method within the sense of the invention.

Other characteristics and advantages of the invention will becomeapparent on examining the detailed description hereinbelow, and theappended drawings in which:

FIG. 1 a represents an audio signal of a piece of music corresponding,in the example represented, to a light popular song;

FIG. 1 b represents the variation in spectral energy as a function oftime, for the piece of music whose audio signal is represented in FIG. 1a;

FIG. 1 c illustrates the durations occupied by the various passages ofthe piece of music of FIG. 1 a and which repeat in this piece;

FIG. 2 diagrammatically represents time windows selected from tworespective parts of the piece of music so as to prepare theconcatenation of these two parts, according to the succession of stepsd1) to d3) hereinabove;

FIG. 3 a diagrammatically represents segments s_(i)(t) and s_(j)(t)selected from the aforesaid respective parts of the piece, so as toprepare a concatenation of the two parts by superposition/addition;

FIG. 3 b diagrammatically illustrates by the sign “⊕” the aforesaidsuperposition/addition;

FIG. 4 illustrates a time window for the aforesaid concatenation, ofpreferred shape and preferred width; and

FIG. 5 represents a flowchart for processing a sound sequence, in apreferred embodiment of the present invention.

The audio signal of FIG. 1 a represents the sound intensity (ordinate)as a function of time (abscissa) of a piece of music (here, the piece“head over feet”© by the artiste Alanis Morissette). To construct thisaudio signal, the respective signals of the right and left channels (instereophonic mode) have been synchronized and added together.

To the audio signal represented in FIG. 1 a is applied a spectraltransform (for example of FFT fast Fourier transform type) to obtain atemporal variation of the spectral energy of the type represented inFIG. 1 b.

In an embodiment, one is concerned with a plurality of successiveshort-term FFTs, the result of which is applied to a bank of filtersover several ranges of frequencies (preferably of wavelengths thatincrease like the logarithm of the frequency). Another Fourier transformis then applied to obtain dynamic parameters of the audio signal (whichare referenced PD in FIG. 1 b). In particular, the ordinate scale ofFIG. 1 b indicates the amplitude of the variations of the components atvarious rates in a given frequency domain. Thus, the index 0 or 2 of thearbitrary ordinate scale of FIG. 1 b corresponds to a slow variation inthe low frequencies, while the index 12 of this same scale correspondsto a fast variation in the high frequencies. These variations areexpressed as a function of time, along the abscissa (seconds). Theintensities associated with these dynamic parameters PD, over time, areillustrated by various gray levels whose relative values are indicatedby the reference column COL (on the right in FIG. 1 b).

It is indicated that the dynamic parameters of the type represented inFIG. 1 b make it possible to identify a piece of music completely. Inthis context of “imprint” of a piece of music, patent applicationFR-2834363 from the applicant describes in a detailed manner theseparameters and the way of obtaining them.

As a variant, the variables deduced from the audio signal and making itpossible to characterize the piece of music may be of different type, inparticular so-called “Mel Frequency Cepstral Coefficients”. Globally, itis indicated that these coefficients (known per se) are still obtainedby a short-term fast Fourier transform.

FIG. 1 c offers a visual representation of the profile of the spectralenergy of FIG. 1 b. In FIG. 1 c, the abscissa represents time (inseconds) and the ordinates represent the various parts of the piece,such as the verses, the refrains, the introduction, a theme, or thelike. The repetition over time of a similar part, such as a verse or arefrain, is represented by hatched rectangles which appear at variousabscissae over time (and which may be of different temporal widths), butof like ordinates. To go from the representation of FIG. 1 b to therepresentation of FIG. 1 c, a statistical analysis is implemented usingfor example the “K-means” algorithm, or else the “FUZZY K-means”algorithm, or else a hidden Markov chain, with learning by theBAUM-WELSH algorithm, followed by an evaluation by the VITERBIalgorithm.

Typically, the determination of the number of states (the parts of thepiece of music) which are necessary for the representation of a piece ofmusic is performed in an automated manner, by comparison of thesimilarity of the states found at each iteration of the aforesaidalgorithms, and by eliminating the redundant states. This technique,termed “pruning” thus makes it possible to isolate each redundant partof the piece of music and to determine its temporal coordinates (itsstart and end instants, as indicated hereinabove).

Thus, one studies the variations, for example in the tonal frequencies(of a human voice), of the spectral energy to determine the repetitionof a particular musical passage in the audio signal.

Preferably, one seeks to extract one or more musical passages whoseduration is the biggest in the piece of music and/or whose frequency ofrepetition is the biggest.

For example, for most light popular pieces, it will be possible tochoose to isolate the refrain parts, whose repetition is generally themost frequent, and then the verse parts, whose repetition is frequent,then, as the case may be, other parts again if they repeat.

It is indicated that other types of subsequences representative of thepiece of music may be extracted, provided that these subsequences repeatin the piece of music. For example, it is possible to choose to extracta musical motif, generally of shorter duration than a verse or arefrain, such as a passage of percussion repeated in the piece of music,or else a vocal phrase chanted several times in the piece. Furthermore,a theme may also be extracted from the piece of music, for example amusical phrase repeated in a piece of jazz or of classical music. Inclassical music, a passage such as a movement may moreover be extracted.

In the visual resume represented by way of example in FIG. 1 c, thehatched rectangles indicate the presence of a part of the piece such asthe introduction (“intro”), of a verse or of a refrain in a time windowindicated by the temporal abscissa (in seconds). Thus, between 0 andaround 15 seconds, the piece of music begins with an introduction(indexed by the digit 2 on the ordinate scale). The introduction isfollowed by two alternations of a verse (indexed by the digit 3) and ofa refrain (indexed by the digit 1) up to around 100 seconds.

Reference is now made to FIG. 5 to describe the main steps of the methodfor obtaining the aforesaid sound resume, according to a preferredembodiment. Firstly, the audio signals are obtained on the left channel“audio L” and on the right channel “audio R” in the respective steps 10and 11, when the initial sound sequence is represented in stereophonicmode. The signals of these two channels are added together in step 12 toobtain an audio signal of the type represented in FIG. 1 a. This audiosignal is, as the case may be, stored in sampled form in a work memorywith sound intensity values ranked as a function of their associatedtemporal coordinates (step 14). To these audio data are applied aspectral transform (of FFT type in the example represented), in step 16,to obtain, in step 18, the spectral coefficients F_(i)(t) and/or theirvariation ΔF_(i)(t) as a function of time. In step 20, a statisticalanalysis module operates on the basis of the coefficients obtained instep 18 to isolate instants t₀, t₁, . . . , t₇ which correspond to startand end instants of the various subsequences which repeat in the audiosignal of step 14.

In the example represented, the piece of music exhibits a structure(classical in light popular) of the type comprising:

-   -   an introduction in the start of the piece between an instant to        and an instant t₁,    -   a verse between t₁ and t₂,    -   a refrain between t₂ and t₃,    -   a second verse between t₃ and t₄,    -   a second refrain between t₄ and t₅,    -   an introduction, again, as the case may be supplemented with an        instrumental solo, between the instants t₅ and t₆, and    -   the repetition of two end-of-piece refrains between the instants        t₆ and t₇.

In step 22, the instants t₀ to t₇ are catalogued and indexed as afunction of the corresponding musical passage (introduction, verse orrefrain) and stored, as the case may be, in a work memory. In step 23,it is then possible to construct a visual resume of this piece of music,as represented in FIG. 5.

In the example described hereinabove of a light popular piece comprisinga typical structure, the sound resume is constructed from a verseextracted from the piece, followed by a refrain extracted from thepiece. In step 24, a concatenation is prepared of the sound samples ofthe audio signal between the instants t₁ and t₂, on the one hand, andbetween the instants t₂ and t₃, on the other hand, in the exampledescribed. As the case may be, the result of this concatenation isstored in a permanent memory MEM for subsequent use, in step 26.

However, as a general rule, the end instant of an isolated verse and thestart instant of an isolated refrain are not necessarily identical, orelse, one may choose to construct the sound resume from the first verseand the second refrain (between t₄ and t₅) or from the end refrain(between t₆ and t₇). Thus, the two passages selected to construct thesound resume are not necessarily contiguous.

A blind concatenation of sound signals corresponding to two parts of apiece of music gives an impression unpleasant to the ear. Hereinbelow isdescribed, with reference to FIGS. 2, 3 a, 3 b and 4, the constructionof a sound signal by concatenation of two parts of a piece of music, insuch a way as to overcome this problem.

One of the aims of this construction by concatenation is to locallypreserve the tempo of the sound signal.

Another aim is to ensure a temporal distance between points ofconcatenation (or points of “alignment”) that is equal to an integermultiple of the duration of a bar.

Preferably, this concatenation is performed by superposition/addition ofsound segments chosen and isolated from the two abovementionedrespective parts of the piece of music.

Described below is a superposition/addition of such sound segments,firstly by beat synchronization (termed “beat-synchronous”), then by barsynchronization according to a preferred embodiment.

The following notation applies:

bpm, the number of beats per minute of a piece of music,

-   -   D, the reference of this number bpm (for example in the case of        a piece denoted “120=crotchet”, bpm=120 and D=crotchet),    -   T, the duration (expressed in seconds) of a beat, that is to say        of the reference D: in the above example where D=crotchet, we        have $T = \frac{60}{bpm}$    -   N, the numerator of the metric of the piece of music (for        example, in the case of a bar denoted “¾, N=3),    -   M, the duration (expressed in seconds) of a bar, given by the        relation M=N.T (i.e. M=3*60/120 in the above example),    -   s(t), the audio signal of a piece of music,    -   ŝ(t), the signal reconstructed by superposition/addition, and    -   s_(i)(t) and s_(j)(t), the i^(th) and j^(th) segments which        comprise respective audio signals belonging to a first and to a        second passage of a piece of music, and which are used for the        construction of ŝ(t) by superposition/addition.

In principle, the aforesaid first and second passages are notcontiguous. ŝ(t) is then obtained as follows.

Referring to FIG. 2, the segments s_(i)(t) and s_(j)(t) are firstlyformed by splitting the audio signal with the aid of a time windowh_(L)(t), of width L and defined (of non zero value) between 0 and L.This window may be of rectangular type, of so-called “Hanning” type, ofso-called “staircase Hanning” type, or the like. Referring to FIG. 4, apreferred type of time window is obtained by concatenation of a risingflank, of a plateau and of a falling flank. The preferred temporal widthof this window is indicated hereinbelow.

The first segment s_(i)(t) is then defined so that:s _(i)(t)=s(t+m _(i)).h _(L)(t)  [1]where m_(i) is the start instant of the first segment.

As shown by FIG. 3 a, s_(j)(t) is constructed in substantially the sameway:s _(j)(t)=s(t+m _(j)).h _(L)(t)  [1a]where m_(j) is the start instant of the second segment.

Even if the duration L of the time window is the same for both segments,it is however indicated that the shape of the window may be differentfrom one segment s_(i)(t) to the other s_(j)(t), as shown moreover byFIG. 2.

Let b_(i) and b_(j) be two respective positions inside the first andsecond segments, and called the “synchronization positions”, withrespect to which the superposition/addition is performed, and such that:0≦b _(i) ≦L and 0≦b _(j) ≦L  [2]

Advantageously, the temporal distance between b_(i) and b_(j) is chosenequal to an integer multiple of the duration T of a beat(b_(j)−b_(i)=kT). Under these conditions, there is said to be a“beat-synchronous” reconstruction if $\begin{matrix}{{{\hat{s}(t)} = {\sum\limits_{i}{s_{i}^{\prime}\left( {t - {\left( {i - 1} \right) \cdot \left( {k^{\prime}T} \right)} + c} \right)}}}{with}} & \lbrack 4\rbrack \\{{s_{i}^{\prime}(t)} = {s_{i}\left( {t + b_{i}} \right)}} & \lbrack 5\rbrack\end{matrix}$and where k′ is the largest integer such that k′T≦L−(b_(i)−m_(i)), c isa time constant such that c=b_(i)−m_(i).

Advantageously, the distance between the instants m_(i) and m_(j) ischosen equal to an integer multiple of k′NT, in which N denotes thenumerator of the metric.

Thus, the reconstructed signal may be written:${\hat{s}(t)} = {\sum\limits_{i}{s_{i}^{\prime}\left( {t - {\left( {i - 1} \right) \cdot \left( {k^{\prime}{NT}} \right)} + c} \right)}}$

An in-time synchronous superposition/addition is then obtained. FIG. 3 billustrates this situation. FIG. 4 shows that the width L of theaforesaid time window is approximately k′NT (to within the rising andfalling flanks). However, ramps of flanks such that k′T≦L−2(b_(i)−m_(i))will preferably be chosen in this case.

More particularly, the instants m_(i) and m_(j) are chosen so that theycorrespond to a first bar time. Under these conditions, a so-called“aligned” beat-synchronous superposition/addition is advantageouslyobtained.

Thus, by moreover determining the metric of the first passage and/or ofthe second passage, an in-time beat-synchronous reconstruction can beperformed. If, moreover, the first and second segments are chosen sothat they commence with a first bar time, this beat-synchronousreconstruction is aligned.

It is indicated that a reconstruction of the signal ŝ (t) may beundertaken on the basis of more than two musical passages to beconcatenated. For i musical passages (i>2), the generalization of theabove method is expressed by the relation: $\quad\begin{matrix}{{\hat{s}(t)} = {{s_{1}^{\prime}\left( {t + c} \right)} + {s_{2}^{\prime}\left( {t - {k_{1}^{\prime}T} + c} \right)} + {s_{3}^{\prime}\left( {t - {k_{1}^{\prime}T} + {k_{2}^{\prime}T} + c} \right)} + \ldots +}} \\{s_{i}^{\prime}\left( {t + {\sum\limits_{j = 1}^{i}\quad{\left( {- 1} \right)^{j}k_{j}^{\prime}}} + T + c} \right)}\end{matrix}$

Each integer kj′ is defined as the largest integer such thatkj′T≦L_(j)−(b_(j)−m_(j)), where L_(j) corresponds to the width of thewindow of the j^(th) musical passage to be concatenated.

It is indicated that the first bar times, or else the metric, or elsethe tempo of a piece of music, may be detected automatically, forexample by using existing software applications. For example, the MPEG-7standard (Audio Version 2) provides for the determination and thedescription of the tempo and of the metric of a piece of music, by usingsuch software applications.

Of course, the present invention is not limited to the embodimentdescribed hereinabove by way of example; it extends to other variants.

Thus, it will be understood that the sound resume may comprise more thantwo musical passages, for example an introduction, a verse and arefrain, or else two different passages of a verse and of a refrain,such as the introduction and a refrain, for example.

It will also be noted that the steps represented in flowchart form inFIG. 5 may be implemented by computer software whose algorithm globallyrecalls the structure of the flowchart. In this regard, the presentinvention is also aimed at such a computer program.

1. (canceled)
 2. (canceled)
 3. (canceled)
 4. (canceled)
 5. (canceled) 6.(canceled)
 7. (canceled)
 8. (canceled)
 9. (canceled)
 10. (canceled) 11.(canceled)
 12. (canceled)
 13. A method of processing a sound sequencecorresponding in particular to a piece of music comprising a successionof subsequences from among at least an introduction, a verse, a refrain,a bridgeway, a theme, a motif, a movement, in which: a) a spectraltransform is applied to said sequence to obtain spectral coefficientsvarying as a function of time in said sequence, b) at least onesubsequence repeated in said sequence is determined by statisticalanalysis of said spectral coefficients, and c) start and end instants ofa first subsequence, such as a verse, and of a second subsequence, suchas a refrain, are evaluated so as to substantially concatenate the firstsubsequence with the second subsequence.
 14. The method of claim 13further comprising: d) of extraction of a repeated subsequence so as tostore, in a memory, sound samples representing said subsequence.
 15. Themethod of claim 14, wherein the extraction of d) relates to at least onesubsequence whose duration is the biggest and/or one subsequence whosefrequency of repetition is the biggest in said sequence.
 16. The methodof claim 15 wherein the first and the second subsequence are extractedso as to obtain, on a memory medium, a sound resume of said piece ofmusic comprising at least the first subsequence strung together with thesecond subsequence.
 17. The method of claim 16 wherein the extracts ofthe subsequences are non-contiguous in time, wherein d) includes: d1)detecting at least one cadence of the first subsequence and/or of thesecond subsequence so as to estimate the mean duration of a bar at saidcadence, as well as at least one end segment of the first subsequenceand at least one start segment of the second subsequence, of respectivedurations corresponding substantially to said mean duration and isolatedin the sequence by an integer number of mean durations, d2) generatingat least one transition bar of duration corresponding to said meanduration and comprising an addition of the sound samples of at leastsaid end segment and of at least said start segment, d3) andconcatenating the first subsequence, the transition bar or bars and thesecond subsequence to obtain a stringing together of the first and ofthe second subsequence.
 18. The method of claim 17 wherein d1) includesa splitting into at least two windows, of rectangular type, of Hanningtype, of staircase Hanning type, or preferably of type comprising aflank that rises, a plateau and a flank that descends over time.
 19. Themethod of claim 17 wherein d2) includes a beat-synchronousreconstruction.
 20. The method of claim 19 wherein, in d1), the metricof the first subsequence and/or of the second subsequence are/isdetermined, wherein d2) includes an in-time beat-synchronousreconstruction.
 21. The method of claim 19 wherein, in d1), the end andstart segments are determined in such a way that they commence with afirst bar time, wherein d2) includes an aligned beat-synchronousreconstruction.
 22. A computer program product residing on a computerreadable medium having a plurality of instructions stored thereon which,when executed by the processor, cause that processor to perform themethod of claim 13.